One man’s wilderness is another man’s theme park.——Author unknown
7 sins
This article analyze the 7 felonies in New York City occurred in the last ten years (2006-2015). The exploration begins with preprocessing of the dataset, and disect the dataset from two major perspectives: when and where.
The NYPD 7 Major Felony Incidents dataset contains Seven Major Felonies that is quarterly updated at the incident level. It was made public at Dec 29, 2015, and is available here.
According to the NYPD Incident Level Data Footnotes, two points are worth mentioning:
Crime complaints which involve multiple offenses are classified according to the most serious offense
For privacy reasons, incidents have been moved to the midpoint of the street segment on which they occur.
Attempted crimes are recorded as if the crime actually occurred
Data presented here are based on the year the incident was reported, not necessarily when it occurred. The result is that some crimes listed here were reported during this time, but may have occurred in a previous year
The first point indicates that the number of actual incidents is larger than that in the dataset. Since we know nothing about which types of offenses are typically associated together in incidents of multiple offenses, we make no assumption on it. The second point affects the accuracy of incident locations, but at the scale of borough or city level, the inaccuracy in longitude and latitude will not have major impact on overall distribution of incidents on map, since we are more interested in the overall metric, such as total number and density of murder in Manhattan. # Questions
Safety is one of the most important human needs. No one wants to involve with any incidents in NYC and be unsafe. As such, I mainly focus on answering the following questions:
Is NYC becoming a safer city over the last 10 years?
Which months in one year can be considered as unsafe?
Which days in a week can be considered as unsafe?
Which hours in a day can be considered as “unsafe”?
According to historical data, which places are more unsafe than others?
The dataset is 194M in size, and contains around 1.2 million rows with 22 variables.
# install.packages("ff")
library(ff)
## Loading required package: bit
## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
##
## Attaching package: 'bit'
## The following object is masked from 'package:base':
##
## xor
## Attaching package ff
## - getOption("fftempdir")=="/var/folders/3_/l6r7z49x7pjgjhk1c2q1kdq00000gn/T//Rtmpw9Aha0"
## - getOption("ffextension")=="ff"
## - getOption("ffdrop")==TRUE
## - getOption("fffinonexit")==TRUE
## - getOption("ffpagesize")==65536
## - getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
## - getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
## - getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system
##
## Attaching package: 'ff'
## The following objects are masked from 'package:bit':
##
## clone, clone.default, clone.list
## The following objects are masked from 'package:utils':
##
## write.csv, write.csv2
## The following objects are masked from 'package:base':
##
## is.factor, is.ordered
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("/Users/sundeepblue/Bootcamp/projects/explore_dataset")
# since the csv has around 1.1 million lines, we use ffdf for fast loading
t = read.csv.ffdf(file="NYPD_7_Major_Felony_Incidents.csv", header=TRUE,
VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)
## read.table.ffdf 1..10000 (10000) csv-read=0.945sec ffdf-write=0.286sec
## read.table.ffdf 10001..60000 (50000) csv-read=6.088sec ffdf-write=0.591sec
## read.table.ffdf 60001..110000 (50000) csv-read=4.642sec ffdf-write=0.593sec
## read.table.ffdf 110001..160000 (50000) csv-read=4.623sec ffdf-write=0.818sec
## read.table.ffdf 160001..210000 (50000) csv-read=4.803sec ffdf-write=0.643sec
## read.table.ffdf 210001..260000 (50000) csv-read=4.685sec ffdf-write=1.491sec
## read.table.ffdf 260001..310000 (50000) csv-read=4.746sec ffdf-write=0.964sec
## read.table.ffdf 310001..360000 (50000) csv-read=4.706sec ffdf-write=1.361sec
## read.table.ffdf 360001..410000 (50000) csv-read=4.435sec ffdf-write=1.458sec
## read.table.ffdf 410001..460000 (50000) csv-read=5.038sec ffdf-write=1.092sec
## read.table.ffdf 460001..510000 (50000) csv-read=4.586sec ffdf-write=1.64sec
## read.table.ffdf 510001..560000 (50000) csv-read=4.563sec ffdf-write=1.799sec
## read.table.ffdf 560001..610000 (50000) csv-read=5.069sec ffdf-write=3.235sec
## read.table.ffdf 610001..660000 (50000) csv-read=5.324sec ffdf-write=4.327sec
## read.table.ffdf 660001..710000 (50000) csv-read=5.503sec ffdf-write=2.491sec
## read.table.ffdf 710001..760000 (50000) csv-read=5.354sec ffdf-write=2.336sec
## read.table.ffdf 760001..810000 (50000) csv-read=4.343sec ffdf-write=2.305sec
## read.table.ffdf 810001..860000 (50000) csv-read=4.543sec ffdf-write=2.739sec
## read.table.ffdf 860001..910000 (50000) csv-read=4.566sec ffdf-write=2.128sec
## read.table.ffdf 910001..960000 (50000) csv-read=4.712sec ffdf-write=4.546sec
## read.table.ffdf 960001..1010000 (50000) csv-read=7.274sec ffdf-write=3.014sec
## read.table.ffdf 1010001..1060000 (50000) csv-read=5.362sec ffdf-write=3.839sec
## read.table.ffdf 1060001..1110000 (50000) csv-read=6.506sec ffdf-write=4.333sec
## read.table.ffdf 1110001..1123465 (13465) csv-read=1.204sec ffdf-write=1.925sec
## csv-read=113.62sec ffdf-write=49.954sec TOTAL=163.574sec
# convert ffdf type to data.frame for later conversion
t = as.data.frame(t)
# convert some columns with unnecesary factor type into character type
t$Identifier = as.character(t$Identifier)
t$Occurrence.Date = as.character(t$Occurrence.Date)
t$Location.1 = as.character(t$Location.1)
# we are only interested in offenses in last 10 years [2006, 2015]
t_2006_to_2015 = filter(t, Occurrence.Year >= 2006)
# ignore observations with missing value
t_2006_to_2015 = na.omit(t_2006_to_2015)
# re-level the levels of factor variable "Day.of.Week"
t_2006_to_2015$Day.of.Week = factor(t_2006_to_2015$Day.of.Week,
levels=c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"));
# define 3 helper functions for string manipulation
trim_spaces = function(x) { gsub("^\\s+|\\s+$", "", x) }
trim_first_char = function(x) { gsub('^.', '', x) }
trim_last_char = function(x) { gsub('.$', '', x) }
# parse the column "Location.1"
loc_strings = t_2006_to_2015$Location.1
L = sapply(loc_strings, strsplit, split=',')
L = sapply(L, unlist)
loc_x = sapply(L[1,], trim_first_char)
loc_y = sapply(L[2,], trim_last_char)
# add two columns "loc_x" and "loc_y"
t_2006_to_2015$loc_x = as.numeric(loc_x)
t_2006_to_2015$loc_y = as.numeric(loc_y)
# write.table(t_2006_to_2015, file="NYPD_7_FELONIES_2006_2015.csv", sep=",")
After preprocessing, we load the preprocessed dataset as follows. First we load several libraries.
library(ff)
library(ggplot2)
library(dplyr)
library(grid)
library(gridExtra)
library(leaflet)
library(ggmap)
require(cowplot)
Then let us read the pre-saved dataset into memory.
setwd("/Users/sundeepblue/Bootcamp/projects/explore_dataset")
# tb = read.csv('NYPD_7_FELONIES_2006_2015.csv', header=T, sep=',',
# stringsAsFactors=F)
tb = t_2006_to_2015
str(tb)
## 'data.frame': 1117256 obs. of 22 variables:
## $ OBJECTID : int 258 259 260 261 262 263 264 265 266 267 ...
## $ Identifier : chr "13b6949b" "c9cc0e6d" "b6c31b82" "2598f1c7" ...
## $ Occurrence.Date : chr "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" ...
## $ Day.of.Week : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Occurrence.Month : Factor w/ 13 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Occurrence.Day : int 9 9 9 9 9 9 9 9 9 9 ...
## $ Occurrence.Year : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ Occurrence.Hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CompStat.Month : int 8 1 2 1 1 1 6 8 1 1 ...
## $ CompStat.Day : int 16 9 22 9 19 9 28 31 23 9 ...
## $ CompStat.Year : int 2006 2006 2006 2006 2006 2006 2006 2007 2006 2006 ...
## $ Offense : Factor w/ 7 levels "BURGLARY","FELONY ASSAULT",..: 3 3 3 1 1 1 3 3 4 1 ...
## $ Offense.Classification: Factor w/ 1 level "FELONY": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sector : Factor w/ 30 levels "","0","1","2",..: 19 16 14 12 14 21 16 19 18 17 ...
## $ Precinct : int 102 1 24 49 49 52 72 75 104 10 ...
## $ Borough : Factor w/ 7 levels "BRONX","BROOKLYN",..: 4 3 3 1 1 1 2 2 4 3 ...
## $ Jurisdiction : Factor w/ 25 levels "DEPT OF CORRECTIONS",..: 6 6 6 6 6 6 6 6 6 13 ...
## $ XCoordinate : int 1029007 979426 992387 1020746 1028121 1016306 984147 1018380 1014406 985325 ...
## $ YCoordinate : int 194256 199624 228450 246675 251461 257760 175275 184864 198825 215233 ...
## $ Location.1 : chr "(40.6997596520001, -73.8385879319999)" "(40.7146054090001, -74.017402787)" "(40.7937230380001, -73.9706144319999)" "(40.843673874, -73.868096037)" ...
## $ loc_x : num 40.7 40.7 40.8 40.8 40.9 ...
## $ loc_y : num -73.8 -74 -74 -73.9 -73.8 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:17] 17334 30414 1112868 1112883 1113049 1113064 1113233 1113320 1113400 1113682 ...
## .. ..- attr(*, "names")= chr [1:17] "17334" "30414" "1112868" "1112883" ...
We explore this dataset by focusing on two perspectives: when and where. The “when” is related to revealing how did years, months, days of week, and hours correlate to offenses. The “where” is related to where did the incidents occur.
t_total_count_by_year_and_offense = tb %>% group_by(Occurrence.Year, Offense) %>%
summarise(count = n())
ggplot(data = t_total_count_by_year_and_offense, aes(x = as.numeric(Occurrence.Year),
y = count)) + ggtitle("Number of offense insidents over the last 10 years (2006-2015)") +
ylab("Year from 2006 to 2015") + xlab("Number of offenses") + geom_point(aes(color = Offense),
size = 5) + geom_line(aes(color = Offense), size = 2)
We can learn from the figure that:
Grand larceny is the most frequent offense out of 7. The number of indidents caused by grand larceny went down in 2015.
There is no clear trend showing felony assult is decling.
The number of all the rest offenses are declining.
Overall, NYC is becoming safer than before.
Now let us see how does the offense incidents distribute across 12 months.
ggplot(data = tb, aes(x = Occurrence.Month, color = Offense)) + geom_bar(aes(fill = Offense)) +
ggtitle("Number of offense insidents across 12 months of all 10 years") +
ylab("Month (from January to December") + xlab("Number of offenses") + theme(legend.position = "bottom") +
facet_wrap(~Offense, nrow = 2)
What we can learn:
Suprisingly, the number of incidents in April is smallest for all offenses except murder.
It seems that Auguest has the second smallest overall incidents.
What does it look like by day of week?
ggplot(data = tb, aes(x = Day.of.Week, color = Offense)) + geom_bar(aes(fill = Offense)) +
ggtitle("Number of offense insidents across days of week in 10 years") +
ylab("Day of week") + xlab("Number of offenses") + theme(legend.position = "bottom") +
facet_wrap(~Offense, nrow = 1)
Insight: On Friday, B, GL, GLMV, RB occur most frequently. The feature is visually very perceivable.
Insight: RP and MD occurs more on weekends
And how about?
# This graph shows tons of insights. TODO: add label for each subplot (add
# 'middle night', 'early morning', 'morning', 'noon', 'afternoon',
# 'evening', 'late night')
tb = tb
ggplot(data = tb, aes(x = Occurrence.Hour)) + geom_density(aes(color = Offense)) +
theme(aspect.ratio = 1) + ggtitle("The density of offense insidents across 24 hours of day in 10 years") +
ylab("Hour") + xlab("Density of offenses") + theme(legend.position = "bottom") +
facet_wrap(~Offense, nrow = 1)
It shows that xxx, xxx. But in order to get more detailed information, we need to look at the histogram, and identify the most apprent features.
ggplot(data = tb, aes(x = Occurrence.Hour)) + geom_histogram(aes(fill = Offense),
binwidth = 1) + ggtitle("The histogram of offense insidents across 24 hours of day in 10 years") +
ylab("Hour") + xlab("Number of offenses") + theme(legend.position = "bottom") +
facet_wrap(~Offense, nrow = 1)
We can learn that xxxx.
Spatial analysis to reveal what is the distribution of offenses across boroughs.
Crimes occurring anywhere other than at an intersection are represented by a midpoint X coordinate and a midpoint Y coordinate (center of block)
Rape offenses are geo-coded as occurring at the police station house within the precinct of occurrence
Offenses that lack an X coordinate and Y coordinate are geo-coded as occurring at the police station house within the precinct of occurrence.
Offenses occurring in open areas such as parks or beaches may be geo-coded as occurring on streets or intersections bordering the area.
Now we analysis the offense by borough in 2015. Since we will use the library “ggmap” to visualize map, we need to first add two columns: ‘lon’ and ‘lat’:
tb = tb %>% mutate(lon = loc_y, lat = loc_x)
ggplot(data = (tb %>% filter(Occurrence.Year == 2015)), aes(x = Offense)) +
geom_bar(aes(fill = Offense)) + facet_wrap(~Borough, nrow = 1)
Now let’s draw an interactive map:
t_MANHATTAN_2015 = tb %>% filter(Borough=="MANHATTAN", Occurrence.Year=="2015")
leaflet(tb, height=300, width=900) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=t_MANHATTAN_2015$loc_y, lat=t_MANHATTAN_2015$loc_x,
clusterOptions = markerClusterOptions())
Let’s first define a function to draw density map of offenses for different boroughs.
draw_density_map_of_offenses_for_borough = function(area_map_obj, tb) {
ggmap(area_map_obj, base_layer = ggplot(aes(x = lon, y = lat), data = tb)) +
stat_density2d(aes(x = lon, y = lat), data = tb, color = "blue", alpha = 0.5) +
stat_density2d(aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
bins = 7, geom = "polygon", alpha = 0.3, contour = T, data = tb) +
scale_fill_gradient(low = "green", high = "red") + facet_wrap(~Offense,
nrow = 2)
}
Draw the density map for Manhattan:
area_map_obj = get_map("MANHATTAN NYC", zoom = 13, maptype = "toner-background",
source = "stamen")
t_manhattan = tb %>% filter(Borough == "MANHATTAN", Occurrence.Year == "2015") %>%
mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_manhattan)
Draw the density map for Queens:
area_map_obj = get_map("QUEENS NYC", zoom = 11, maptype = "toner-background",
source = "stamen")
t_queens = tb %>% filter(Borough == "QUEENS", Occurrence.Year == "2015") %>%
mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_queens)
Draw the density map for Bronx:
area_map_obj = get_map("BRONX NYC", zoom = 12, maptype = "toner-background",
source = "stamen")
t_bronx = tb %>% filter(Borough == "BRONX", Occurrence.Year == "2015") %>% mutate(lon = loc_y,
lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_bronx)
Draw the density map for Brooklyn:
area_map_obj = get_map("Brooklyn NYC", zoom = 12, maptype = "toner-background",
source = "stamen")
t_brooklyn = tb %>% filter(Borough == "BROOKLYN", Occurrence.Year == "2015") %>%
mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_brooklyn)
Draw the density map for Staten Island:
area_map_obj = get_map("Staten Island NYC", zoom = 12, maptype = "toner-background",
source = "stamen")
t_staten_island = tb %>% filter(Borough == "STATEN ISLAND", Occurrence.Year ==
"2015") %>% mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_staten_island)